Automatic email classification
نویسنده
چکیده
The endlessly increasing volume of unsolicited emails (a.k.a. spam) has become more and more of a concern. Its hassles range from a daily loss of time for the end-user, required to keep her mailbox clean, to a financial loss for the ISPs, constantly in need of larger bandwidths and disk space. According to a recent study, MSN and AOL discard together almost five billion of such emails every day[2]. Several technical solutions have been proposed to counter this plague, which are mainly deterministic. The most used are black lists, white lists and score-based filters[7, 3]. Black lists use a large database of email servers known to be owned by spammers to block any email coming from them. White lists allow only emails from known senders to go through. Scorebased filters give a number of points to each incoming emails according to their contents, e.g. 5 points when the word “free” is present. Emails with high scores are then blocked. However the efficiency of such techniques is limited and spammers have already found ways of bypassing them. A few years ago a new solution has been proposed: bayesian classification[3, 4, 6, 5]. The results showed in these papers and the number of related software currently under development[7] seem to indicate a large potential. This is what this project will investigate. I propose to tackle this issue with two improvements in mind. Firstly, the adaptation of the boosting chain procedure used by some face detection algorithms[8]. This should improve the speed of the overall procedure by adopting a simple-to-complex strategy, complex classifiers being applied only to emails not classified by simpler ones. At the same time it might decrease the false positive rate. Secondly, the creation of a language-independent classifier. Being from France, I receive emails both in French and in English. Using a unique classifier on all emails seems quite inefficient considering the vastly different statistics of words and grammatical rules in both languages. The language-independence will be handled in the most general way to allow the classifier to deal with any language. However, the lack of data and personal knowledge will restrict the study of the generalization to Polish. If these two improvements prove to be successful, a third one will be handled: the generalization of the classifier to n classes. Current classifiers have only two classes, namely “spam” and “not spam”. A more interesting algorithm would also handle classes like “friends”, “family”, “work”, etc. thus leading to a true automatic email classifier.
منابع مشابه
Predicting The Type of Malaria Using Classification and Regression Decision Trees
Predicting The Type of Malaria Using Classification and Regression Decision Trees Maryam Ashoori1 *, Fatemeh Hamzavi2 1School of Technical and Engineering, Higher Educational Complex of Saravan, Saravan, Iran 2School of Agriculture, Higher Educational Complex of Saravan, Saravan, Iran Abstract Background: Malaria is an infectious disease infecting 200 - 300 million people annually. Environme...
متن کاملAn Automatic Fingerprint Classification Algorithm
Manual fingerprint classification algorithms are very time consuming, and usually not accurate. Fast and accurate fingerprint classification is essential to each AFIS (Automatic Fingerprint Identification System). This paper investigates a fingerprint classification algorithm that reduces the complexity and costs associated with the fingerprint identification procedure. A new structural algorit...
متن کاملDimensionality Reduction and Improving the Performance of Automatic Modulation Classification using Genetic Programming (RESEARCH NOTE)
This paper shows how we can make advantage of using genetic programming in selection of suitable features for automatic modulation recognition. Automatic modulation recognition is one of the essential components of modern receivers. In this regard, selection of suitable features may significantly affect the performance of the process. Simulations were conducted with 5db and 10db SNRs. Test and ...
متن کاملAn Automatic Fingerprint Classification Algorithm
Manual fingerprint classification algorithms are very time consuming, and usually not accurate. Fast and accurate fingerprint classification is essential to each AFIS (Automatic Fingerprint Identification System). This paper investigates a fingerprint classification algorithm that reduces the complexity and costs associated with the fingerprint identification procedure. A new structural algorit...
متن کاملNaı̈ve-Bayes vs. Rule-Learning in Classification of Email
Recent growth in the use of email for communication and the corresponding growth in the volume of email received has made automatic processing of email desirable. Two learning methods, naı̈ve bayesian learning with bag-valued features and the RIPPER rule-learning algorithm have shown promise in other text categorization tasks. I present three experiments in automatic mail foldering and spam filt...
متن کاملEmail Folder Classification using Threads
While automatic classification of email is obviously a useful task to study, it is not obvious how to best utilize the rich metadata specific to email to improve the quality of the classification. In this paper, we propose a simple algorithm for using email threads to improve the precision of a personal email assistant’s automatic folder classification. We evaluate the approach on a large email...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003